Characterizing the diversity of MHC conserved extended haplotypes using families from the United Arab Emirates

Aside from its anthropological relevance, the characterization of the allele frequencies of genes in the human Major Histocompatibility Complex (MHC) and the combination of these alleles that make up MHC conserved extended haplotypes (CEHs) is necessary for histocompatibility matching in transplantation as well as mapping disease association loci. The structure and content of the MHC region in Middle Eastern populations remain poorly characterized, posing challenges when establishing disease association studies in ethnic groups that inhabit the region and reducing the capacity to translate genetic research into clinical practice. This study was conceived to address a gap of knowledge, aiming to characterize CEHs in the United Arab Emirates (UAE) population through segregation analysis of high-resolution, pedigree-phased, MHC haplotypes derived from 41 families. Twenty per cent (20.5%) of the total haplotype pool derived from this study cohort were identified as putative CEHs in the UAE population. These consisted of CEHs that have been previously detected in other ethnic groups, including the South Asian CEH 8.2 [HLA- C*07:02-B*08:01-DRB1*03:01-DQA1*05:01-DQB1*02:01 (H.F. 0.094)] and the common East Asian CEH 58.1 [HLA- C*03:02-B*58:01-DRB1*03:01- DQA1*05:01-DQB1*02:01 (H.F. 0.024)]. Additionally, three novel CEHs were identified in the current cohort, including HLA- C*15:02-B*40:06-DRB1*16:02-DQB1*05:02 (H.F. 0.035), HLA- C*16:02-B*51:01-DRB1*16:01-DQA1*01:02-DQB1*05:02 (H.F. 0.029), and HLA- C*03:02-B*58:01-DRB1*16:01-DQA1*01:02-DQB1*05:02 (H.F. 0.024). Overall, the results indicate a substantial gene flow with neighbouring ethnic groups in the contemporary UAE population including South Asian, East Asian, African, and European populations. Importantly, alleles and haplotypes that have been previously associated with autoimmune diseases (e.g., Type 1 Diabetes) were also present. In this regard, this study emphasizes that an appreciation for ethnic differences can provide insights into subpopulation-specific disease-related polymorphisms, which has remained a difficult endeavour.


Methods
Recruitment. Families were approached and briefed on the study and invited to participate. The cohort also included a subset of five families that have been previously published by Tay, et al. 35 . Those families included healthy parents and at least one child with Type 1 Diabetes. Specifically, only the phased haplotypes of the healthy parents were retained for the current study. Families were randomly recruited from different parts of the UAE including northern, western, eastern, and south-eastern regions. All the participants recruited for the study were UAE nationals. Nonetheless, no sub-ethnic or country of ancestral origin information was collected from the recruited participants.
Ethics declarations. All   Segregation analysis. Segregation analysis by pedigree was independently conducted by the co-authors, and all haplotypes assigned by these individuals were concordant. Each family had identical 8-locus haplotypes (HLA-A-C-B-DRB1-DQA1-DQB1-DPA1-DPB1) by descent. When a parent's genotype is missing, data of at least two non-HLA identical children were required for the family to be included in the study.
HLA nomenclature. This report follows the latest HLA nomenclature system for reporting and naming HLA alleles and haplotypes 36 . The asterisk "*" denotes molecular typing. The digits before the first colon (field 1) indicate the allele group or type. The subtype is indicated by the next set of digits (field 2), while synonymous variants are indicated by the third set of digits (field 3).
Population genetic analysis. The samples were genotyped at up to the 4th field of resolution. However, statistical population genetic analysis was limited to the 2nd field of resolution to allow for comparisons with previously published reports in other populations. Allele frequencies (A.F.), the degree of heterozygosity, and Guo and Thompson Hardy Weinberg equilibrium (HWE) at a locus-by-locus level were computed using Python for Population Genomics (PyPop v.0.7.0) 37 . The genetic diversity at the allelic level for the UAE cohort was calculated using polymorphism information content (PIC) and power of discrimination (PD) implemented in the FORSTAT tool 38 . Slatkin's implementation of the Ewens-Watterson (EW) homozygosity test of neutrality, implemented in PyPop, was performed to examine the effect of natural selection on HLA loci. The test calculated the normalized deviation of homozygosity (Fnd) which is defined as the difference between observed and expected homozygosity divided by the square root of the expected homozygosity's variance. Haplotypes HLA-A-C-B-DRB1-DQA1-DQB1, HLA-C-B, HLA-DRB1-DQA1-DQB1 and HLA-DPA1-DPB1 were observed and manually counted by the co-authors using MS Excel.

MHC conserved extended haplotypes (CEHs).
Putative CEHs (extending from HLA-C to HLA-DQB1) were identified through a previously described and established approach 3,8,13,15,19 . A haplotype frequency cut-off of 0.005 is usually used to distinguish a common CEH in a certain population, considering the high level  A-C-B-DRB1-DQA1-DQB1-DPA1-DPB1) obtained from the segregation analysis were sorted  based on HLA-B, HLA-DRB1, and HLA-DQB1 loci respectively using Microsoft Excel. Next, 5-locus haplotypes (HLA-C-B-DRB1-DQA1-DQB1) Haplotypes that were observed at least 5 times were extracted for further analysis of CEH. Novel CEH were named according to a previously described system by Degli-Esposti, et al. 19 , in which the CEH is identified by its HLA-B allele type, followed by a sequential number indicating its order of discovery (e.g., 18 Analysis of genetic relationships with other populations. A Principal Component Analysis (PCA) plot and a phylogenetic tree were generated for 50 populations including the cohort studies herein, with highresolution genotypes of HLA-A, HLA-B, and HLA-DRB1. Those loci were chosen as they exhibit the greatest level of heterogeneity, effectively representing world populations while simultaneously expanding the number of datasets available for the analysis. The world populations datasets were obtained from the Allele Frequency Net Database (AFND) 39 . The populations were selected from different world regions including the Middle East, Central and South Asia, Sub-Saharan Africa, North Africa, Oceania, South America, East Asia, and Europe. The world populations datasets were chosen only if they satisfy the gold and silver quality standard based on AFND criteria 39 . The PCA was conducted using IBM SPSS Statistics 19 software (IBM Corporation, Armonk, NY, USA). The phylogenetic tree was constructed using the neighbour-joining (NJ) clustering method implemented in POPTREEW. The distance was set to Nei's genetic distance (DA), and the Bootstrap to 1,000 replications.

Results
HLA allele and MHC haplotype frequencies: genetic similarity with other populations. The current cohort included 40 two-generation and one three-generation families from the UAE (see Table S1).
In total, 170 phased HLA-A-C-B-DRB1-DQA1-DQB1-DPA1-DPB1 haplotypes were described by segregation analysis. Ten haplotypes were obtained from the three-generation family (referred to as HF8); 4 from the grandparents, and 6 from 3 individuals who married into the family. Ambiguities and allelic dropout in parental genotypes were resolved by inference from offspring. Only one and three genotypes were missing from HLA-DQA1 and HLA-DQB1 respectively, due to sequencing error. HLA class I and class II allele count, and frequencies are listed in Tables 1 and 2 Overall, no deviation from HWE was observed except for HLA-DQB1 (Table S2). The PIC and PD for HLA-A, HLA-C, HLA-B, HLA-DRB1, HLA-DQA1, and HLA-DQB1 were calculated to measure the extent of genetic diversity within the cohort (Table S2). The HLA class I loci were relatively more diverse compared to the HLA class II loci with HLA-B being the most polymorphic locus at a PIC of 0.94 and HLA-DQA1 being the least polymorphic locus with a PIC of 0.82. A PD value greater than 0.80 is indicative of a high degree of polymorphism 40 . The results of the EW homozygosity test of neutrality are summarized in Table S3. A large negative Fnd value suggests that the observed homozygosity is skewed toward balancing selection, while a strong positive value implies directional selection. From the results, only the HLA-DRB1 locus showed a slight directional selection. The two loci HLA-DPA1 and HLA-DPB1 were excluded from the HWE, PIC, PD, and EW homozygosity analyses.
The PCA plot shown in Fig. 1 shows that the UAE clusters with the Omani population (abbreviated as 'Oma') and the Baloch subpopulation of Iran (abbreviated as 'IrB'), and then South American and European populations (with some proximity to East Asian populations). Similarly, the phylogenetic tree in Fig. 2 shows that the UAE population is genetically close to the Baloch subpopulation of Iran. Description and reference for each population dataset used in the PCA and phylogenetic tree are listed in Table S7.
Identification of HLA conserved extended haplotypes. The complete list of the phase-segregated 5-locus MHC haplotypes (HLA-C-B-DRB1-DQA1-DQB1) observed in the current UAE cohort is presented in Table S8. To allow for a more rigorous identification of MHC CEHs in the UAE population, only CEHs with H.F. > 0.02, are described and discussed hereafter (See Table 6 When combined, these five CEHs represent 20.6% (35 out of 170) of the haplotype pool in the current UAE cohort. Subsequently, these CEH were analyzed to infer their most probable ancestry (MPA) based on previously published frequencies in African, Asian, and Caucasian populations 41 . MPA is based on evaluating the existence of distinctive ethnic/region-specific CEH in the relevant continental such that CEHs that are generally present in high frequency (e.g., H.F. > 0.10) in a particular non-recently admixed human continental group were regarded to be indicative of that regional origin. Table S9 provides the names for the CEHs observed in the study.  Table 8

Discussion
The first whole genomes analysis of two UAE nationals 42,43 has provided insights into the genomic structure and the putative genetic origins of its population. Following that, a comprehensive, large-scale stratification study of the UAE population concluded that genetic admixture throughout the Arabian Peninsula's eastern shore and south-eastern tip happened gradually and without clear social stratification boundaries 43 . This, and another mitogenome study 44 , have shown that there was no apparent association between birthplace and ancestral background, indicating that the contemporary UAE population developed over generations prior to the establishment  Conserved extended haplotypes (CEHs) of the MHC, and their fragments, have been shown to be useful as markers for disease association, immune response, and anthropology. This study describes the diversity of MHC CEHs derived from 41 UAE families. As in the previously cited publications, the data presented herein suggest evidence of gene flow from neighbouring ethnic groups in the contemporary UAE population.    The current study detected 5 putative CEHs in the current UAE population, three of which were identified as novel CEHs. Overall, the aggregate percentage of those 5 putative CEHs was 20.6%.
As noted earlier, HLA-B is the most polymorphic HLA locus. Thus, individual CEHs will be discussed hereafter based on the relevance of the HLA-B allele each CEH contains.
Of the total number of HLA-C*07:02-B*08:01-DRB1*03:01-DQA1*05:01-DQB1*02:01 CEHs observed, 25% were extended to HLA-A*68:01. The association of the 8.2 CEH with the HLA-A*68:01 allele has not been identified in South Indians. Nonetheless, the HLA-A*68:01 allele has been found to be highly prevalent in Native Americans 48 and Africans 49 , whereas it is found to be at low levels in Southeast Asia 50 . A genome-wide study of populations of the Arabian Peninsula demonstrated a Sub-Saharan African input of only 4.0% by 1,754 Common Era (CE) in a cohort from the UAE 51 . Therefore, it can be argued that HLA-A*68:01 was introduced to the UAE from a Sub-Saharan founder, considering that both West and East African populations were transported to the Middle East, Arabia, and the Indian Ocean during the 15th to 19th centuries during a time when the slave trade was common 52 . HLA-A*68:01 is of particular interest due to several unusual features, such as its weak binding affinity to CD8 and its ability to bind unusual long peptides because of peptide bending in the binding groove 53 .
Overall, 88.9% of the HLA-B*08:01 alleles observed were part of CEHs identified in South Asians 46,47,54 . On the other hand, however, one family (Family IDs: HF11) carried the Caucasian 8.1 CEH, implying a possible Caucasian origin (Table 7).
According to the IMGT/HLA database HLA-B*40 is one of the most polymorphic lineages of HLA antigens 55 15,57 . The HLA-B*51 allele is considered the risk factor for Behçet's disease, a disease that has a strong geographical prevalence distribution along the ancient Silk Road which ran from the Mediterranean to Northern China 58 . Therefore, the prevalence of Behçet's is highest among populations of Japan, China, Korea, Turkey, Iran, Tunisia, and other Middle Eastern countries 59 , whereas it is low in Africa, Oceania, and South America, where the frequency of the HLA-B*51 allele is low 60 The East Asian 58.1 and its recombinants were also observed at high frequency in people from the Arabian Peninsula 31,32,39 , as well as South Asia 46 , but not in Caucasians 8 , indicating a possible genetic link with populations from East Asia. This can be supported by historical documents which indicate that bidirectional trade movements from Central and South Asia through the Arabian Gulf into the Arabian peninsula's south-eastern region, which currently includes the UAE, were feasible and did occur 62 . Furthermore, as evident by autosomal Short Tandem Repeats (STRs) genotyping, this cultural diffusion from Arabia has shaped worldwide Muslim populations in Asia including the Thai-Malay 63 and Chinese Muslim populations 64 . Furthermore, analysis by autosomal STRs 65 , mitochondrial DNA 66 , and Y-chromosomes 67 have revealed that historically attested movements into the Indian subcontinent have accounted for a cultural diffusion as well as a minor but detectable gene flow from West Asia and Arabia.
Natural selection 3,8,17 is often considered an important component in the evolution of the MHC and the production of CEHs. However, evident by the information presented here and other reports [42][43][44] , it seems that the MHC genomic landscape of the contemporary UAE nationals must have also been shaped by both transcontinental migration between Africa, Asia, and Europe, which involved a diverse array of ethnic groupings 34,51 , and the nomadic lifestyles of some Arabian communities, notably the Bedouins.
HLA allele frequency as genetic estimators were shown to have the ability to mimic the results obtained with genome-wide data for PCA 6 . In the current study, high resolution and quality HLA allele frequency data from Middle Eastern populations were scarce, which may have resulted in an imbalance of the clustering pattern in the PCA plot (Fig. 1) and the phylogenetic tree (Fig. 2). The analysis of the genetic relationship between the current UAE dataset with world populations using PCA and the phylogenetic tree seem to provide significantly different qualitative findings from one another. Additionally, the identified CEH and their ethnic identities observed in the current cohort do not seem to correlate with the results of the PCA plot or the phylogenetic tree. We argue that the direction of the gene flow at the CEH level (whether it is from East to West or vice versa) requires additional evaluations of the whole Asian continent from the Arabian Peninsula to north-eastern Siberia, and from the northern Urals to Southeast Asia.
High-resolution HLA typing and haplotyping are critical in hematopoietic stem cell transplantation for both unrelated and related donors, particularly in reducing post-transplantation adverse outcomes 68,69 . It is noted that a single high-resolution HLA mismatch may have the same negative effect on outcomes as a low-resolution one 70,71 . As a result, high-resolution HLA typing to lower the probability of missing a clinically important mismatch has been proposed 68 . To this end, data presented herein provide a framework for donor selection during organ and bone marrow transplantation, as well as the identification of permitted mismatches disease risk markers.
Previously, results generated from this laboratory on UAE families with Type 1 Diabetes identified two CEHs (namely 8.2 and 50.2) that have been previously associated with the disease in a neighbouring Indian population 54 . Likewise, several alleles and CEHs associated with autoimmunity and related conditions in other genetically related populations have been identified with high frequency in the current cohort. In this context, further research could be directed into comparing the influence of established HLA autoimmune diseases associations in Arabs using pedigree-based analysis. For example, all the Indian 8.2 CEHs identified herein were intact and therefore present a good model for recombination and disease association mapping.
Further investigation can be carried out in a larger sample size in addition to genotyping different marker catalogues including non-HLA genes (e.g. MICA, MICB, TNF, C2, Bf, C4, among others), microsatellite markers, and polymorphic Alu insertions (POALINs) [72][73][74] across the MHC of the UAE populations to ascertain the degree of similarities to other haplotypes of the same CEH blocks, measure the sizes of DNA blocks that may be fixed, and map the recombination hotspots.

Conclusion
Despite being based on a limited number of haplotypes, this preliminary report identified conserved extended HLA haplotypes in UAE populations and presented evidence of the presence of shared CEHs between the UAE Arab population and other neighboring populations. To the best of our knowledge, this is the first attempt to identify CEH in Arabs using high-resolution HLA pedigree-phased haplotypes.